A Character-based Approach to Distributional Semantic Models: Exploiting Kanji Characters for Constructing JapaneseWord Vectors
نویسنده
چکیده
Many Japanese words are made of kanji characters, which themselves represent meanings. However traditional word-based distributional semantic models (DSMs) do not benefit from the useful semantic information of kanji characters. In this paper, we propose a method for exploiting the semantic information of kanji characters for constructing Japanese word vectors in DSMs. In the proposed method, the semantic representations of kanji characters (i.e, kanji vectors) are constructed first using the techniques of DSMs, and then word vectors are computed by combining the vectors of constituent kanji characters using vector composition methods. The evaluation experiment using a synonym identification task demonstrates that the kanji-based DSM achieves the best performance when a kanji-kanji matrix is weighted by positive pointwise mutual information and word vectors are composed by weighted multiplication. Comparison between kanji-based DSMs and word-based DSMs reveals that our kanji-based DSMs generally outperform latent semantic analysis, and also surpasses the best score word-based DSM for infrequent words comprising only frequent kanji characters. These findings clearly indicate that kanji-based DSMs are beneficial in improvement of quality of Japanese word vectors.
منابع مشابه
Contribution of sublexical information to word meaning: An objective approach using latent semantic analysis and corpus analysis on predicates
Past studies have employed a subjective rating/categorization methodology to investigate whether radicals, an example of sub-lexical visual information in Chinese/kanji, contribute to computation of character/word meaning, with conflicting results. This study took an objective, corpus-based approach for the first time. Specifically, we conducted a Latent Semantic Analysis based on Japanese news...
متن کاملSubstroke Approach to HMM-Based On-line Kanji Handwriting Recognition
A new method is proposed for on-line handwriting recognition of Kanji characters. The method employs substroke HMMs as minimum units to constitute Japanese Kanji characters and utilizes the direction of pen motion. The main motivation is to fully utilize the continuous speech recognition algorithm by relating sentence speech to Kanji character , phonemes to substrokes, and grammar to Kanji stru...
متن کاملNeural basis of hierarchical visual form processing of Japanese Kanji characters
INTRODUCTION We investigated the neural processing of reading Japanese Kanji characters, which involves unique hierarchical visual processing, including the recognition of visual components specific to Kanji, such as "radicals." METHODS We performed functional MRI to measure brain activity in response to hierarchical visual stimuli containing (1) real Kanji characters (complete structure with...
متن کاملNormal and impaired reading of Japanese kanji and kana
Two kinds of scripts are used in the written forms of Japanese words: morphographic kanji and phonographic kana. Whereas each kana character invariably represents a single pronunciation, the majority of kanji characters have two or more legitimate pronunciations, with one appropriate to the character in any given word. Furthermore, each kanji character has meaning while a kana character does no...
متن کاملDocument Classification Using Domain Specific Kanji Characters Extracted by X2 Method
In this paper we describe a method of classifying Japanese text documents using domain specific kanji charactcrs. Text documents are generally cb~ssified by significant words (keywords) of the documents. However, it is difficult to extract these significant words from Japanese text, because Japanese texts are written without using blank spaces, such as delimiters, and must be segmented into wor...
متن کامل